Customer Analysis Of 1 Mg Mall

Problem Statement :

1 Mg Mall,Bangalore approched us to know better insights of their customers. For that Analysis they provided us a dataset maintained by them. The dataset contains different Demographic information and Behavioral data of their customers.

Objective:

The data set contains Customers annual income and spending score(it's a number in range 1-100 which shows the customer's spending ability). We will use different clustering algorithms to segment those customers and analyse those clusters to explore their customers.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
#dataset = pd.read_excel('Tenovia2.xlsx')
dataset = pd.read_csv('Tenovia2.csv')
dataset.head()
Out[2]:
CustomerID Gender Age Annual Income (INR) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
In [3]:
dataset.drop(['CustomerID'],axis = 1, inplace = True)

Gender diversity in the Mall

In [4]:
genders = dataset.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values, palette="Blues_d")
plt.show()

The bar plot clearly says that Female customers are more than Male customers in this 1 Mg Mall.

In [5]:
dataset.describe()
Out[5]:
Age Annual Income (INR) Spending Score (1-100)
count 200.000000 200.000000 200.000000
mean 38.850000 60.560000 50.200000
std 13.969007 26.264721 25.823522
min 18.000000 15.000000 1.000000
25% 28.750000 41.500000 34.750000
50% 36.000000 61.500000 50.000000
75% 49.000000 78.000000 73.000000
max 70.000000 137.000000 99.000000
In [6]:
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
sns.boxplot(y=dataset["Spending Score (1-100)"], color="red")
plt.subplot(1,2,2)
sns.boxplot(y=dataset['Annual Income\n(INR)'], color="blue")
plt.show()

So, Here We can see from the above plot and discription table that the age group of 28 to 49 are the major customers of the 1 Mg Mall and their Income in range of 41-78 INR and their spending score in range of 34-73(Out of 100).

In [7]:
plt.figure(figsize=(20,10))
x = dataset['Annual Income\n(INR)']
y = dataset['Age']
z = dataset['Spending Score (1-100)']

sns.lineplot(x, y, color = 'green')
sns.lineplot(x, z, color = 'orange')
plt.title('Annual Income vs Age and Spending Score', fontsize = 20)
plt.show()
C:\Users\KIIT\AppData\Roaming\Python\Python37\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

In this Plot green line represents how the Annual Income(INR) varies with Age, and the orange line in the plot shows how Annual Income and the Spending Score varying.

In [8]:
plt.figure(figsize=(20,10))
sns.boxplot(
    data=dataset,
    x='Age',
    y='Spending Score (1-100)',
    color='blue')
plt.title('Age vs Spending Score', fontsize = 20)
Out[8]:
Text(0.5, 1.0, 'Age vs Spending Score')

Above Box plot presents the Spending Score's of different age group of customers.

Finding the relation between Age,Annual income and Spening Score

In [9]:
plt.figure(1,figsize=(15,7))
n=0
for x in ['Age','Annual Income\n(INR)','Spending Score (1-100)']:
    for y in ['Age','Annual Income\n(INR)','Spending Score (1-100)']:
        n+=1
        plt.subplot(3,3,n)
        plt.subplots_adjust(hspace=0.5,wspace=0.5)
        sns.regplot(x=x,y=y,data=dataset)
        plt.ylabel(y.split()[0]+''+y.split()[1] if len(y.split())>1 else y)
        
plt.show()
C:\Users\KIIT\AppData\Roaming\Python\Python37\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

In the above plot's we can clearly see how Income, Spending Score varying with age group and vice versa too.

K-Means clustering

In [10]:
X = dataset.iloc[:,1:].values

#Using the elbow method to find the optimum number of clusters
from sklearn.cluster import KMeans

wcss = []

for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=500, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)
    
plt.figure(figsize=(8,8))
plt.plot(range(1,11),wcss,'r',marker='o', markersize=10)
plt.axvline(5, ls="--", c="b")
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within cluster sum of square')
plt.show()

Based on the elbow plot above, we can choose 4 to 6 clusters.
As here elbow is not much clear let us try to visualize the clusters with mid elbow value ie. 5.

In [11]:
X= dataset[['Age', 'Annual Income\n(INR)', 'Spending Score (1-100)']]

# initialise and fit K-Means model
KM_5_clusters = KMeans(n_clusters=5, init='k-means++').fit(X) 
KM5_clustered = X.copy()

# append labels to points
KM5_clustered.loc[:,'Cluster'] = KM_5_clusters.labels_ 
In [12]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))


scat_1 = sns.scatterplot('Annual Income\n(INR)', 'Spending Score (1-100)', data=KM5_clustered,
                hue='Cluster', ax=axes[0], palette='Set1', legend='full')

sns.scatterplot('Age', 'Spending Score (1-100)', data=KM5_clustered,
                hue='Cluster', palette='Set1', ax=axes[1], legend='full')

axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

K-Means algorithm generated the following 5 clusters:

  1. Customers with low annual income and high spending score
  2. Customers with medium annual income and medium spending score
  3. Customers with high annual income and low spending score
  4. Customers with high annual income and high spending score
  5. Customers with low annual income and low spending score
In [13]:
import plotly as py
import plotly.graph_objs as go

def tracer(db, n, name):
    
    return go.Scatter3d(
        x = db[db['Cluster']==n]['Age'],
        y = db[db['Cluster']==n]['Spending Score (1-100)'],
        z = db[db['Cluster']==n]['Annual Income\n(INR)'],
        mode = 'markers',
        name = name,
        marker = dict(
            size = 5
        )
     )

e0 = tracer(KM5_clustered, 0, 'low annual income and high spending')
e1 = tracer(KM5_clustered, 1, 'medium annual income and medium spending')
e2 = tracer(KM5_clustered, 2, 'high annual income and low spending')
e3 = tracer(KM5_clustered, 3, 'high annual income and high spending')
e4 = tracer(KM5_clustered, 4, 'low annual income and low spending')

data = [e0, e1, e2, e3, e4]

layout = go.Layout(
    title = 'Clusters by K-Means',
    scene = dict(
            xaxis = dict(title = 'Age'),
            yaxis = dict(title = 'Spending Score'),
            zaxis = dict(title = 'Annual Income')
        )
    #,width=1000,
    #height=1000
)

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
In [14]:
KM_clust_sizes = KM5_clustered.groupby('Cluster').size().to_frame()
KM_clust_sizes.columns = ["KM_size"]
KM_clust_sizes
Out[14]:
KM_size
Cluster
0 80
1 36
2 39
3 23
4 22

The biggest cluster is a cluster number 2 with 77 observations (high annual income and low spending clients). There are the smallest ones with 23 observations (cluster 4 low annual income and low spending clients).

As I got highest customers with "high annual income and low spending clients", According to me it's not much accurate, it's not satisfactory result according to me .So,I'm going to check with 6 clusters.

In [15]:
KM_6_clusters = KMeans(n_clusters=6, init='k-means++').fit(X) 

KM6_clustered = X.copy()
KM6_clustered.loc[:,'Cluster'] = KM_6_clusters.labels_
In [16]:
fig2, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot('Annual Income\n(INR)', 'Spending Score (1-100)', data=KM6_clustered,
                hue='Cluster', ax=axes[0], palette='Set1', legend='full')

sns.scatterplot('Age', 'Spending Score (1-100)', data=KM6_clustered,
                hue='Cluster', palette='Set1', ax=axes[1], legend='full')

# plotting centroids
axes[0].scatter(KM_6_clusters.cluster_centers_[:,1], KM_6_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_6_clusters.cluster_centers_[:,0], KM_6_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

K-Means algorithm generated the following 6 clusters:

  1. younger clients with medium annual and medium spending score
  2. clients with high annual income and low spending score
  3. younger clients with medium annual and medium spending score
  4. clients with high annual income and high spending score
  5. clients with low annual income and low spending score
  6. clients with low annual income and high spending score
In [17]:
KM6_clust_size = KM6_clustered.groupby('Cluster').size().to_frame()
KM6_clust_size.columns = ["KM_size"]
KM6_clust_size
Out[17]:
KM_size
Cluster
0 35
1 45
2 39
3 22
4 21
5 38

So, Finally found that the biggest cluster is a cluster 2 with 45 observations (Customers with medium annual income and medium spending clients). There are the smallest one is cluster 4 with 21 observations (Customers with low annual income and low spending clients).

So Finally using K-Means algorithm with 6_clusters provides Better Customer Analysis.